AITopics | algerian dialect

Collaborating Authors

algerian dialect

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

chDzDT: Word-level morphology-aware language model for Algerian social media text

Aries, Abdelkrime

arXiv.org Artificial IntelligenceSep-3-2025

Pre-trained language models (PLMs) have substantially advanced natural language processing by providing context-sensitive text representations. However, the Algerian dialect remains under-represented, with few dedicated models available. Processing this dialect is challenging due to its complex morphology, frequent code-switching, multiple scripts, and strong lexical influences from other languages. These characteristics complicate tokenization and reduce the effectiveness of conventional word- or subword-level approaches. To address this gap, we introduce chDzDT, a character-level pre-trained language model tailored for Algerian morphology. Unlike conventional PLMs that rely on token sequences, chDzDT is trained on isolated words. This design allows the model to encode morphological patterns robustly, without depending on token boundaries or standardized orthography. The training corpus draws from diverse sources, including YouTube comments, French, English, and Berber Wikipedia, as well as the Tatoeba project. It covers multiple scripts and linguistic varieties, resulting in a substantial pre-training workload. Our contributions are threefold: (i) a detailed morphological analysis of Algerian dialect using YouTube comments; (ii) the construction of a multilingual Algerian lexicon dataset; and (iii) the development and extensive evaluation of a character-level PLM as a morphology-focused encoder for downstream tasks. The proposed approach demonstrates the potential of character-level modeling for morphologically rich, low-resource dialects and lays a foundation for more inclusive and adaptable NLP systems.

arabic, large language model, machine learning, (22 more...)

arXiv.org Artificial Intelligence

2509.01772

Genre: Research Report > New Finding (1.00)

Industry: Information Technology (0.67)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
(3 more...)

Add feedback

CAFE A Novel Code switching Dataset for Algerian Dialect French and English

Lachemat, Houssam Eddine-Othman, Abbas, Akli, Oukas, Nourredine, Kheir, Yassine El, Haboussi, Samia, Shammur, Absar Chowdhury

arXiv.org Artificial IntelligenceNov-20-2024

The paper introduces and publicly releases (Data download link available after acceptance) CAFE -- the first Code-switching dataset between Algerian dialect, French, and english languages. The CAFE speech data is unique for (a) its spontaneous speaking style in vivo human-human conversation capturing phenomena like code-switching and overlapping speech, (b) addresses distinct linguistic challenges in North African Arabic dialect; (c) the CAFE captures dialectal variations from various parts of Algeria within different sociolinguistic contexts. CAFE data contains approximately 37 hours of speech, with a subset, CAFE-small, of 2 hours and 36 minutes released with manual human annotation including speech segmentation, transcription, explicit annotation of code-switching points, overlapping speech, and other events such as noises, and laughter among others. The rest approximately 34.58 hours contain pseudo label transcriptions. In addition to the data release, the paper also highlighted the challenges of using state-of-the-art Automatic Speech Recognition (ASR) models such as Whisper large-v2,3 and PromptingWhisper to handle such content. Following, we benchmark CAFE data with the aforementioned Whisper models and show how well-designed data processing pipelines and advanced decoding techniques can improve the ASR performance in terms of Mixed Error Rate (MER) of 0.310, Character Error Rate (CER) of 0.329 and Word Error Rate (WER) of 0.538.

dataset, dialect, speech, (12 more...)

arXiv.org Artificial Intelligence

2411.13424

Country:

Africa > Middle East > Algeria > Bouïra Province > Bouira (0.06)
North America > United States > Hawaii > Honolulu County > Honolulu (0.04)
Europe > Germany > Berlin (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Media (0.46)
Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

FASSILA: A Corpus for Algerian Dialect Fake News Detection and Sentiment Analysis

Abdedaiem, Amin, Dahou, Abdelhalim Hafedh, Cheragui, Mohamed Amine, Mathiak, Brigitte

arXiv.org Artificial IntelligenceNov-7-2024

Building a corpus become an important topic in natural language processing (NLP) and especially for low resource languages (ex: AD), due to the importance that the corpus plays in the development of several tools, such as: Machine Translation Babaali and Salem [2022], Part of speech tagging Chiche and Yitagesu [2022], Named entities recognition Jarrar et al. [2022], etc. in particular with the emergence of techniques based on statistics, machine learning and deep learning. Who exploits this mass of information to develop, train and evaluate models. However, building a corpus is not an easy task Bakari et al. [2016]; it is extremely time-consuming and requires a lot of work, for the good reason that the volume and quality of the corpus are two important parameters. Despite the recent emergence of techniques that consume fewer resources, such as few-shot learning Tunstall et al. [2022]. Over the last few years, a lot of studies in NLP have focused on languages or variants of languages called low resources Mengoni and Santucci [2023]. This change of direction is mainly due to the emergence of social media such as Facebook, Twitter, RenRen, LinkedIn, Google+, and Tuenti, as a means of communication where people exchange messages and comments.

algerian dialect, corpus, dialect, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1016/j.procs.2024.10.214

2411.04604

Country:

Africa > Middle East > Algeria > Adrar Province > Adrar (0.04)
Europe > Germany (0.04)
North America > United States (0.04)
(3 more...)

Genre: Research Report > New Finding (1.00)

Industry: Media > News (0.86)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Add feedback

Hate speech detection in algerian dialect using deep learning

Lanasri, Dihia, Olano, Juan, Klioui, Sifal, Lee, Sin Liang, Sekkai, Lamia

arXiv.org Artificial IntelligenceSep-20-2023

With the proliferation of hate speech on social networks under different formats, such as abusive language, cyberbullying, and violence, etc., people have experienced a significant increase in violence, putting them in uncomfortable situations and threats. Plenty of efforts have been dedicated in the last few years to overcome this phenomenon to detect hate speech in different structured languages like English, French, Arabic, and others. However, a reduced number of works deal with Arabic dialects like Tunisian, Egyptian, and Gulf, mainly the Algerian ones. To fill in the gap, we propose in this work a complete approach for detecting hate speech on online Algerian messages. Many deep learning architectures have been evaluated on the corpus we created from some Algerian social networks (Facebook, YouTube, and Twitter). This corpus contains more than 13.5K documents in Algerian dialect written in Arabic, labeled as hateful or non-hateful. Promising results are obtained, which show the efficiency of our approach.

algerian dialect, deep learning, hate speech detection

arXiv.org Artificial Intelligence

2309.11611

Genre: Research Report (0.40)

Industry: Information Technology (0.73)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.60)

Add feedback

DziriBERT: a Pre-trained Language Model for the Algerian Dialect

Abdaoui, Amine, Berrimi, Mohamed, Oussalah, Mourad, Moussaoui, Abdelouahab

arXiv.org Artificial IntelligenceDec-12-2022

Pre-trained transformers are now the de facto models in Natural Language Processing given their state-of-the-art results in many tasks and languages. However, most of the current models have been trained on languages for which large text resources are already available (such as English, French, Arabic, etc.). Therefore, there are still a number of low-resource languages that need more attention from the community. In this paper, we study the Algerian dialect which has several specificities that make the use of Arabic or multilingual models inappropriate. To address this issue, we collected more than one million Algerian tweets, and pre-trained the first Algerian language model: DziriB-ERT. When compared with existing models, DziriBERT achieves better results, especially when dealing with the Roman script. The obtained results show that pre-training a dedicated model on a small dataset (150 MB) can outperform existing models that have been trained on much more data (hundreds of GB). Finally, our model is publicly available to the community.

artificial intelligence, machine learning, natural language, (16 more...)

arXiv.org Artificial Intelligence

2109.12346

Country:

Europe > Finland > Northern Ostrobothnia > Oulu (0.05)
Africa > Middle East > Algeria > Sétif Province > Sétif (0.05)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)

Genre: Research Report (0.85)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.69)

Add feedback